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Abstract —Tight bounds for several symmetric divergence mea¬ 
sures are introduced, given in terms of the total variation 
distance. Each of these bounds is attained by a pair of 2 or 3- 
element probability distributions. An application of these bounds 
for lossless source coding is provided, refining and improving a 
certain bound by Csiszar. A new inequality relating /-divergences 
is derived, and its use is exemplified. The last section of this 
conference paper is not included in the recent journal paper tl6l . 
as well as some new remarks that are linked to new references. 

I. Introduction and Preliminaries 

Divergence measures are widely used in information theory, 
machine learning, statistics, and other theoretical and applied 
branches of mathematics (see, e.g., El, El, isi). The class 
of /-divergences forms an important class of divergence 
measures. Their properties, including relations to statistical 
tests and estimators, were studied, e.g., in El and C3. 

In m, Gilardoni studied the problem of minimizing an 
arbitrary symmetric /-divergence for a given total variation 
distance (these terms are defined later in this section), provid¬ 
ing a closed-form solution of this optimization problem. In a 
follow-up paper by the same author cni , Pinsker’s and Vajda’s 
type inequalities were studied for symmetric /-divergences, 
and the issue of obtaining lower bounds on /-divergences for 
a fixed total variation distance was further studied. 

One of the main results in ifTOl was a further derivation of 
a simple closed-form lower bound on the relative entropy in 
terms of the total variation distance. The relative entropy is an 
asymmetric /-divergence, as it is clarified in the continuation 
to this section. The lower bound on the relative entropy sug¬ 
gests an improvement over Pinsker’s and Vajda’s inequalities. 
A derivation of a simple and reasonably tight closed-form 
upper bound on the infimum of the relative entropy has been 
also provided in flOl in terms of the total variation distance. 
An exact characterization of the minimum of the relative 
entropy subject to a fixed total variation distance has been 
derived in [HI and ||9|. 

Sharp inequalities for /-divergences were recently studied 
in im as a general problem of maximizing or minimizing 
an arbitrary /-divergence between two probability measures 
subject to a finite number of inequality constraints on other 
/-divergences. The main result stated in ifTTl is that such 
infinite-dimensional optimization problems are equivalent to 
optimization problems over finite-dimensional spaces where 
the latter are numerically solvable. 


The total variation distance has been further studied from 
an information-theoretic perspective by Verdu CD, providing 
upper and lower bounds on the total variation distance between 
two probability measures P and Q in terms of the distribution 
of the relative information log^(X) and log^(V) where 
X and Y are distributed according to P and Q, respectively. 

Following previous work, tight bounds on symmetric /- 
divergences and related distances are introduced in this paper. 
An application of these bounds for lossless source coding is 
provided, refining and improving a certain bound by Csiszar 
El- The material in this conference paper appears in the 
recently published journal paper by the same author Gl. 
However, we also provide in this conference paper a new 
inequality relating /-divergences, and its use is exemplified; 
this material is not included in the journal paper ifT^ since it 
does not necessarily refer to symmetric /-divergences. 

The paper is organized as follows: tight bounds for several 
symmetric divergence measures, which are either symmetric 
/-divergences or related symmetric distances, are introduced 
without proofs in Section [III these bounds are expressed in 
terms of the total variation distance. An application for the 
derivation of an improved and refined bound in the context 
of lossless source coding is provided in Section HII] The full 
version of this work, including proofs of the tight bounds 
in Section Un] appears in fTh). Section ITVl provides a new 
inequality that relates between /-divergences; this inequality 
is proved since it is not included in the journal paper fTfij. 

We end this section by introducing some preliminaries. 

Definition 1: Let P and Q be two probability distributions 
with a common cr-algebra P. The total variation distance 
between P and Q is dj^iP, Q) = sup^gjp \PiA) — Q{A)\. 

If P and Q are defined on a countable set, it is simplified to 

djy{P,Q) = lY.\P{x)-Q{x)\ = ( 1 ) 

X 

Definition 2: Let /: (0,oo) —>■ IR be a convex function 
with /(I) = 0, and let P and Q be two probability distri¬ 
butions. The /-divergence from P to Q is defined by 




with the convention that 


0/(^ 

■)=0, /(O) 

= lim /(f), 

VO 

t^o+ 
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-) = a lim 
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Definition 3: An /-divergence is said to be symmetric if 
Df{P\\Q) = Df{Q\\P) for every P and Q. 

Symmetric /-divergences include (among others) the 
squared Hellinger distance where 

X 

and the total variation distance in ([T]i where f{t) = i 1|. 

An /-divergence is symmetric if and only if the function / 
satisfies the equality (see 10 p. 765]) 


coefficient satisfies the inequality 1 —£ < Z{P, Q) < 

Both upper and lower bounds are tight: the upper bound 
is attained by the pair of 2 -element probability distributions 
P = and Q = , and the lower bound 

is attained by the pair of 3-element probability distributions 
P = (e, 1 — £, 0 ), and Q = ( 0,1 — e, e). 

Remark 1: Although derived independently in this work. 
Proposition [U is a known result in quantum information theory 
(on the relation between the trace distance and fidelity IHl). 

B. A Tight Bound on the Chernojf Information 

Definition 5: The Chernoff information between two prob¬ 
ability distributions P and Q, defined on the same set, is 

C{P,Q) = -^miUj log 


f{u)=uf +a{u-l), VuG(0,oo) (3) 

for some constant a. If / is differentiable at u = 1 then a 
differentiation of both sides of equality ( 0 ) at u = 1 gives that 
a = 2 /'(l). 

Note that the relative entropy (a.k.a. the Kullback-Leibler 
divergence) D{P\\Q) = Ex-P(^) log (^) is an /- 

divergence with f{t) = tlog(t), t > 0; its dual, D{Q\\P), 
is an f-divergence with /(f) = — log(f), f > 0 ; clearly, it is 
an asymmetric /-divergence since D{P\\Q) f D{Q\\P) . 

The following result, which was derived by Gilardoni (see 
a, aioi), refers to the infimum of a symmetric /-divergence 
for a fixed value of the total variation distance: 

Theorem 1: Let /: (0, cxj) —IR be a convex function with 
/(I) = 0, and assume that / is twice differentiable. Let 

V£G[0,1] 

P,Q: dTv(P,Q)=E 

be the infimum of the /-divergence for a given total varia¬ 
tion distance. If 77/ is a symmetric /-divergence, and / is 
differentiable at u = 1 , then 

Ld, (£) = (!-£)/ - 2/'(l) £, Ve G [0,1]. 

11. Tight Bounds on Symmetric Divergence 
Measures 

The following section introduces tight bounds for several 
symmetric divergence measures (where part of them are not 
/-divergences) for a fixed value of the total variation distance. 

A. Tight Bounds on the Bhattacharyya Coefficient 

Definition 4: Let P and Q be two probability distributions 
that are defined on the same set. The Bhattacharyya coefficient 
between P and Q is given by Z{P, Q) = Ex s/ 

Proposition 1: Let P and Q be two probability distribu¬ 
tions. Then, for a fixed value £ G [0,1] of the total variation 
distance (i.e., if dT;y{P,Q) = e), the respective Bhattacharyya 


where throughout this paper, the logarithms are on base e. 
Proposition 2: Let 

C{e)= min C{P,Q), VegIO,!] (4) 

P,Q: dTv(P.Q)=£ 

be the minimum of the Chernoff information for a fixed value 
£ G [0,1] of the total variation distance. This minimum indeed 
exists, and it is equal to 

- 5 log(l-e^) if EG [ 0 , 1 ) 

+00 if £ = 1 . 

For £ G [0,1), it is achieved by the pair of 2-element prob¬ 
ability distributions P = i^), andQ = i^). 

Outline of the proof: Definition 0 with a possibly sub- 
optimal value of A = 5 , and Proposition 0 yield that 

C{P,Q)>-\og{ Y,^P{x)Q{x) 

\ X 

= -log Z{P,Q) 
>_i(l-(rfTv(P,Q))"). (5) 

Consequently, from Q, C{e) > —^ log(l — £^) for £ G [0,1). 
It can be verified that the lower bound on C(P, Q) is achieved 
forP=(i^, l^),andQ=(i^, i^). 

Remark 2: A geometric interpretation of the minimum of 
the Chernoff information subject to a minimal total variation 
distance has been recently provided in IflTl Section 3]. 

Remark 3 (An Application): From 0), a lower bound on 
the total variation distance implies a lower bound on the Cher¬ 
noff information; consequently, it provides an upper bound on 
the best achievable Bayesian probability of error for binary 
hypothesis testing. This approach has been recently used in 
ll^ to obtain a lower bound on the Chernoff information for 
studying a communication problem that is related to channel- 
code detection via the likelihood ratio test. 










C. A Tight Bound on the Capacitory Discrimination 

The capacitory discrimination (a.k.a. the Jensen-Shannon 
divergence) is defined as follows: 

Definition 6: Let P and Q be two probability distributions. 
The capacitory discrimination between P and Q is given by 


C{P,Q) ^DlP 


= 2 


H 


,P + Q 
' 2 
P + Q 


D [ Q 


P + Q 


H{P)+H{Q) 


This divergence measure was studied, e.g., in Cl and d- 
Proposition 3: For every e G [0,1), 

min C(P,Q)=2d( —II 7 :) ( 6 ) 

P,Q-. dTj(P,Q)=e ^ ^ \ 2 '' 2 ) 

and it is achieved by the 2 -element probability distributions 
P = and Q = (i^, i^). In ®, 

d{p\\q) =plog + (1 -p)log ^ 

with the convention that 0 log 0 = 0 . 

Outline of the proof: In ifTTl p. 119], C{P,Q) — Df{P\\Q) 
with /(f) = f logf — (f +1) log(l +f) -1-2 log2 for f > 0. This 
is a symmetric /-divergence where / is convex with /(I) = 0, 
/'(I) = — log 2. Eq. ® follows from Theorem [T| 

D. A Tight Bound on Jeffreys’ divergence 

Definition 7: Let P and Q be two probability distributions. 
Jeffreys’ divergence dl is a symmetrized version of the 
relative entropy, which is defined as 

( 7 ) 

Proposition 4: For every e G [0,1), 

min J{P,Q)=e log (• ( 8 ) 

P.Q : dTv(P,Q)=E \ 1 - e / 

The minimum in ® is achieved by the pair of 2-element 
distributions P = and Q = ^^)- 

Outline of the proof. Jeffreys’ divergence can be expressed 
as a symmetric /-divergence where f(t) = ^ (f — 1 ) logf for 
f > 0. Note that / is convex, and /(I) = /'(I) = 0. Eq. ® 
follows from Theorem [T] 

III. A Bound for Lossless Source Coding 

We illustrate in the following a use of Proposition |4] for 
lossless source coding. This tightens, and also refines under a 
certain condition, a bound by Csiszar a. 

Consider a memoryless and stationary source with alphabet 
U that emits symbols according to a probability distribution 
P, and assume a uniquely decodable (UD) code with an 
alphabet of size d. It is well known that such a UD code 
achieves the entropy of the source if and only if the length 
l{u) of the codeword that is assigned to each symbol u GU 
satisfies the equality l{u) = — log^P(u) for every u G U. 
This corresponds to a dyadic source where, for every u GU, 


we have P{u) — with a natural number n„; in this 

case, l{u) = Uu for every symbol u G U. Let L = IE[L] 
designate the average length of the codewords, and Hd{U) = 
— Pi’’^) be the entropy of the source (to the 

base d). Furthermore, let Cd.i = According to 

the Kraft-McMillian inequality, the inequality Cd,i < I holds 
in general for UD codes, and the equality Cd^i = I holds if the 
code achieves the entropy of the source (i.e., L = Hd{U)). 

Define the probability distribution Qd,i{u) = 

for every u G U, and let /S.d = L — Hd{U) designate the 
redundancy of the code. Note that for a UD code that achieves 
the entropy of the source, its probability distribution P is equal 
to Qd,i (since Cd,i = I, and P{u) = d~d+ for every u G U). 

In a, a generalization for UD source codes has been 
studied by a derivation of an upper bound on the Li norm 
between the two probability distributions P and Qd^i as a 
function of the redundancy Ad of the code. To this end, 
straightforward calculation shows that the relative entropy 
from P to Qdj is given by 

D{P\\Qd,i) = Ad\ogd + \og{cd,i). (9) 

The interest in a is in getting an upper bound that only 
depends on the (average) redundancy Ad of the code, but is 
independent of the specific distribution of the length of each 
codeword. Hence, since the Kraft-McMillian inequality states 
that Cd,i < I for general UD codes, it is concluded in a that 

DiP\\Qd,i) < Ad \ogd. (10) 


Consequently, it follows from Pinsker’s inequality that 

y^|P(M) - Qd.iju)] < min{\/2Arflogd, 2} (11) 

uGU 

where it is also taken into account that, from the triangle 
inequality, the sum on the left-hand side of (fTTT l cannot 
exceed 2. This inequality is indeed consistent with the fact 
that the probability distributions P and Qdj coincide when 
Ad — 0 (i.e., for a UD code which achieves the entropy of 
the source). 

At this point we deviate from the analysis in a. One 
possible improvement of the bound in (fTTI) follows by replac¬ 
ing Pinsker’s inequality with the result in 0, i.e., by taking 
into account the exact parametrization of the infimum of the 
relative entropy for a given total variation distance. This gives 
the following tightened bound: 

^|P(rt) - Qd,iiu)\ < 2 L~^{Ad\ogd) (12) 

uGU 

where L~^ is the inverse function of L, given as follows ESI: 


L(e) 4 inf D{P\\Q) 
P,Q-. dTv(P.Q)=e 

\ f £ + ^ — P 

= min < - 

/3G[e-l,l-£] ( \ 2 

’-fl-£ 


log 

log 


/3 — 1 — e 
P - l + e 
P + 1- e 
P + l + e 


( 13 ) 

















It can be verified that the numerical minimization w.r.t. (3 in 
(IT 3 I 1 can be restricted to the interval [e — 1, 0] (it is calculated 
numerically). 

In the following, the utility of Proposition |4] is shown by 
refining the bound in (ITSli . Let 6{u) = l{u) + \og^P{u) for 
every u ^U. Calculation of the dual divergence gives 

DiQaj\\P) = -\og{cd,i) - (14) 

and the combination of O, (|9]l and (fT4l i yields that 


J{P, Qd,l) = - 


Ad log d - 


logd 

Cd,l 


]E[5(f/)d 


-5(U)1 


(15) 


For the simplicity of the continuation of the analysis, we 
restrict our attention to UD codes that satisfy the condition 


l{u) > 



1 

P^) 


\fuGU. 


(16) 


In general, it excludes Huffman codes; nevertheless, it is sat¬ 
isfied by some other important UD codes such as the Shannon 
code, Shannon-Fano-Elias code, and arithmetic coding. Since 
(fTbl l is equivalent to the condition that <5 is non-negative on 
U, it follows from (flSl l that 

(17) 

so, the upper bound on Jeffreys’ divergence in (fTTl i is twice 
smaller than the upper bound on the relative entropy in (fTOl) . It 
is partially because the term logc^^; is canceled out along the 
derivation of the bound in 0 , in contrast to the derivation 
of the bound in (doll where this term was removed from the 
bound in order to avoid its dependence on the length of the 
codeword for each individual symbol. 

Following Proposition m for an arbitrary x > 0, let e = e(x) 
be the solution in the interval [0,1) of the equation 

£ log ^ 

The combination of (l8]l and (fTTl) implies that 

= ,19) 

new 7 

In the following, the bounds in (fT^ and (fT9l l are compared 
analytically for the case where the average redundancy is small 
(i.e.. Ad ~ 0). Under this approximation, the bound in (fTTT i 
(i.e., the original bound from ID) coincides with its tightened 
version in (fTSl i. On the other hand, since for e Ri 0, the left- 
hand side of (fTsT i is approximately 2e^, it follows from (fTsT i 
that, for X ~ 0, we have e(x) ~ y^. It follows that, if 
Ad ~ 0, inequality (fT^ gets approximately the form 


I P(m) - Qd,i (u) I < a/ Ad log d. 

uEU 

Hence, even for a small redundancy, the bound in (fT^ 
improves by a factor of \/2. 

A numerical comparison of the bounds in (fTTT i. (fTST i and 
( fT9] l is provided in the journal paper, see iflhl Figure 2]. 


Remark 4: Another application of Jeffreys’ divergence has 
been recently studied in IT] Section 5] where the mutual 
information I{X;Y) = D{Px,y\\PxPy) has been upper 
bounded by the symmetrized divergence 

DsymiPx,Y\\PxPY) = D{Px,y\\PxPy) + D{Px Py\\Px,y) 
= 2J{Px,y,PxPy). 


Consequently, the channel capacity satisfies the upper bound 
C = maxp^/(X;y) < 2 maxp^ -PxPy). This 

provides a good bound on the channel capacity in the low SNR 
regime (see HI Section 5]). It has been applied in |[T] Section 6] 
to obtain a bound on the capacity of a linear-time invariant 
Poisson channel; this bound is improved by increasing the 
parameter of the background noise (Aq) H]. 


IV. A New Inequality Relating /-Divergences 
We introduce in the following an inequality which relates 
/-divergences, and its use is exemplified. This inequality is 
proved here since it is not included in the journal paper ifT^ . 
Recall the following definition of the x^-divergence. 
Definition 8: The chi-squared divergence between two 
probability distributions P and Q on a set A is given by 

, 2 


x\P.Q) = 


xeA 


{p{x)-Q{x)y 


P{a 


Q{x) 




- 1 . ( 20 ) 


The chi-squared divergence is an asymmetric /-divergence 
where f{t) = {t — 1)^ for < > 0. 

Proposition 5: Let /: (0,oo) —IR be a convex func¬ 
tion with /(I) = 0 and further assume that the function 
g: (0, 00) IR, defined by g{t) = for every f > 0, is 

also convex. Let P and Q be two probability distributions on 
a finite set A, and assume that P, Q are positive on this set. 
Then, the following inequality holds: 

<-DyP\\Q)-f{l + xyP,Q)) 

Proof: Let A = |xi,..., Xn}j and u = {ui ,..., Un) € 
JRf be an arbitrary n- tuple with positive entries. Define 


Jn{f,U,P) = P^P{Xi) f{Ui) - f [ '^P{Xi)u^ 


( 22 ) 


J„iQ,u,P) 4 y^Q(xi)/(Mj) - / I y^Q(xi)ui j . 

i=l \i=l / 

The following refinement of Jensen’s inequality appears in ||7] 
Theorem 1] for a convex function /: (0, c») —IR, and it has 
been extended in ||2] Theorem 1] to hold for a convex / over 
an arbitrary interval [a, 6]: 


/ P{xi) 

min < —-—- 

Q{Xi) 


< 


Jn{f,U,Q) < Jn{f,U,P) 

( 23 ) 



















The refined version of Jensen’s inequality in ( l2Tt is applied in 
the following to prove (ISTT l. Let Ui = for i S {1,..., n}. 
Calculation of (|2^ gives that 


/.(/.a, 0) = E g(«.) / (gg) - / (e «(-.) ^ 

= -/(!) = (24) 

..,/.aP)^EP(..)/(§|i)-/(E^) 


(J 

(b) 


y-i33(P||Q)-/(l + x"(P,(3)) 


(25) 


where equality (a) holds by the definition of g, and equality (b) 
follows from equalities (|2]) and (l20l) . The substitution of (l24l i 
and (|25]) in (l2Tt completes the proof. ■ 

As a consequence of Proposition |5] we introduce the fol¬ 
lowing inequality which relates between the relative entropy, 
its dual and the chi-squared divergence. 

Corollary 1: Let P and Q be two probability distributions 
on a finite set A, and assume that P, Q are positive on A. 
Then, the following inequality holds: 

<\og{l + x^{P,Q))-D{P\\Q) 

< • i:)(Q||P). (26) 

x^A Q(x) 

Proof: Let f{t) = — log(f) for f > 0. The function 
/: ( 0 ,oo) —IR is convex with /(I) = 0 , and g(t) — 
-tfit) = flog(f) for t > 0 defines a convex function with 
5 ( 1 ) = 0. Inequality (l26l l follows by substituting f,g in (ISTT i 
where Df{P\\Q) = D{Q\\P) and Dg{P\\Q) = D{P\\Q). ■ 

Remark 5: Inequality (l26l l strengthens the inequality 

_ 1 (27) 

which is derived by using Jensen’s inequality as follows 1^ : 

x'(P,Q) = 

x^A 

> iog(f{f7) _ I 

= gD{P\\Q) _ ^ 


The following inequality is another consequence of Propo¬ 
sition |5] relating the chi-squared divergence and its dual: 

Corollary 2: Under the same conditions of Corollary [1] the 
following inequality holds: 


P(x) 


x\Q,P) < 


x"(RQ) 


< max 


P{x) 


■x\Q.P)- 


xeA Q(x) ’ ~ l + X^iP,Q) ~ x£A Q(x} 

Proof: This follows from Proposition |5] where f{t) = 
■1 — 1, and g{f) = —tf(t) = t — 1 for t > 0. Consequently, 
we have Dg(P||Q) = 0, P/(P||Q) = x‘^(Q,P)- ■ 
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